Latency and Bandwidth Requirements of Massively Parallel Programs: FFT as a Case Study

نویسندگان

  • Fabrizio Petrini
  • Marco Vanneschi
چکیده

Many theoretical models of parallel computation are based on overly simplistic assumptions on the performance of the interconnection network. For example they assume constant latency for any communication pattern or innnite bandwidth. This paper presents a case study based on the FFT transpose algorithm, which is mapped on two families of scalable interconnection networks, the k-ary n-trees and the k-ary n-cubes. We analyze in depth the network behavior of a minimal adaptive algorithm for the k-ary n-trees and three algorithms for the k-ary n-cubes, each ooering an increasing degree of adaptivity: the deterministic routing, a minimal adaptive routing based on Duato's methodology and the Chaos routing, a non-minimal adaptive cut-through version of the hot potato routing. The simulation results collected on topologies with up to 256 processing nodes show that the k-ary n-trees can ee-ciently support the basic FFT algorithm by factoring the personalized broadcast in a sequence of congestion-free steps. Though k-ary n-cubes are less favored in terms of bisection bandwidth, we can narrow the performance gap between the two in-terconnection networks by properly interleaving communication and computation. While in the presence of bandwidth-bound patterns the communication latency becomes diicult to predict, the global accepted network bandwidth converges to a xed value after a stabilization period, though both adaptive algorithms on the cubes suuer from post-saturation problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using WPT as a New Method Instead of FFT for ‌Improving the Performance of OFDM Modulation

Orthogonal frequency division multiplexing (OFDM) is used in order to provide immunity against very hostile multipath channels in many modern communication systems.. The OFDM technique divides the total available frequency bandwidth into several narrow bands. In conventional OFDM, FFT algorithm is used to provide orthogonal subcarriers. Intersymbol interference (ISI) and intercarrier interferen...

متن کامل

A Performance Study of Two-Phase I/O

Massively parallel computers are increasingly being used to solve large, I/O intensive applications in many diierent elds. For such applications, the I/O subsystem represents a signiicant obstacle in the way of achieving good performance. While massively parallel architectures do, in general, provide parallel I/O hardware, this alone is not suucient to guarantee good performance. The problem is...

متن کامل

Real-time Parallel Software Design Case Study: Implementation of the Rt-2dfft Benchmark on the Maspar Mp-x Architecture

We extended and tested the MITRE real-time embedded scalable high performance computing benchmarking concept by implementing the RT_2DFFT benchmark on the MasPar MP-X series of massively parallel processors (MPPs). The RT_2DFFT benchmark specifies a symmetric two-dimensional fast Fourier transform (FFT) within a real-time software test bench. The test bench provides the realistic stimulus for t...

متن کامل

A Massively Parallel Multithreaded Architecture: DAVRID

MPAs(Massively Parallel Architectures) should address two fundamental issues for scalability: synchronization and communication latency. Dataaow ar-chitectures cause problems of excessive synchronization costs and ineecient execution of sequential programs while they ooer the ability to exploit massive parallelism inherent in programs. In contrast, MPAs based on von Neumann computational model ...

متن کامل

Tera-Scale 1D FFT with Low-Communication Algorithm and IntelR

This paper demonstrates the first tera-scale performance of Intel © Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996